# Consider the following libraries for this exercise sheet:
library(mlr3)
library(mlr3learners)
library(mlr3tuning)
# for visualization
library(mlr3viz)
library(ggplot2)Exercise 11 – Tuning
$$
% math spaces % N, naturals % Z, integers % Q, rationals % R, reals % C, complex % C, space of continuous functions % machine numbers % maximum error % counting / finite sets % set 0, 1 % set -1, 1 % unit interval % basic math stuff % x tilde % argmax % argmin % argmax with limits % argmin with limits% sign, signum % I, indicator % O, order % partial derivative % floor % ceiling % sums and products % summation from i=1 to n % summation from i=1 to m % summation from j=1 to p % summation from j=1 to p % summation from i=1 to k % summation from k=1 to g % summation from j=1 to g % mean from i=1 to n % mean from i=1 to n % mean from k=1 to g % product from i=1 to n % product from k=1 to g % product from j=1 to p % linear algebra % 1, unitvector % 0-vector % I, identity % diag, diagonal % tr, trace % span % <.,.>, scalarproduct % short pmatrix command % matrix A % error term for vectors % basic probability + stats % P, probability % E, expectation % Var, variance % Cov, covariance % Corr, correlation % N of the normal distribution % dist with i.i.d superscript
% … is distributed as …
% X, input space % Y, output space % set from 1 to n % set from 1 to p % set from 1 to g % P_xy % E_xy: Expectation over random variables xy % vector x (bold) % vector x-tilde (bold) % vector y (bold) % observation (x, y) % (x1, …, xp) % Design matrix % The set of all datasets % The set of all datasets of size n % D, data % D_n, data of size n % D_train, training set % D_test, test set % (x^i, y^i), i-th observation % {(x1,y1)), …, (xn,yn)}, data % Def. of the set of all datasets of size n % Def. of the set of all datasets % {x1, …, xn}, input data % {y1, …, yn}, input data % (y1, …, yn), vector of outcomes % x^i, i-th observed value of x% y^i, i-th observed value of y % (x1^i, …, xp^i), i-th observation vector % x_j, j-th feature % (x^1_j, …, x^n_j), j-th feature vector % Basis transformation function phi % Basis transformation of xi: phi^i := phi(xi)
%%%%%% ml - models general % lambda vector, hyperconfiguration vector % Lambda, space of all hpos % Inducer / Inducing algorithm % Set of all datasets times the hyperparameter space % Set of all datasets times the hyperparameter space % Inducer / Inducing algorithm % Inducer, inducing algorithm, learning algorithm
% continuous prediction function f % True underlying function (if a statistical model is assumed) % True underlying function (if a statistical model is assumed) % f(x), continuous prediction function % f with domain and co-domain % hypothesis space where f is from % Bayes-optimal model % Bayes-optimal model% f_j(x), discriminant component function % f hat, estimated prediction function % fhat(x) % f(x | theta) % f(x^(i)) % f(x^(i)) % f(x^(i) | theta) % fhat_D, estimate of f based on D % fhat_Dtrain, estimate of f based on D %model learned on Dn with hp lambda %model learned on D with hp lambda %model learned on Dn with optimal hp lambda %model learned on D with optimal hp lambda
% discrete prediction function h % h(x), discrete prediction function % h hat % hhat(x) % h(x | theta) % h(x^(i)) % h(x^(i) | theta) % Bayes-optimal classification model % Bayes-optimal classification model
% yhat % yhat for prediction of target % yhat^(i) for prediction of ith targiet
% theta % theta hat % theta vector % theta vector hat %% %theta learned on Dn with hp lambda %theta learned on D with hp lambda % min problem theta % argmin theta
% densities + probabilities % pdf of x % p % p(x) % pi(x|theta), pdf of x given theta % pi(x^i|theta), pdf of x given theta % pi(x^i), pdf of i-th x
% pdf of (x, y) % p(x, y) % p(x, y | theta) % p(x^(i), y^(i) | theta)
% pdf of x given y % p(x | y = k) % log p(x | y = k)% p(x^i | y = k)
% prior probabilities % pi_k, prior% log pi_k, log of the prior % Prior probability of parameter theta
% posterior probabilities % P(y = 1 | x), post. prob for y=1 % P(y = k | y), post. prob for y=k % pi with domain and co-domain % Bayes-optimal classification model % Bayes-optimal classification model % pi(x), P(y = 1 | x) % pi, bold, as vector % pi_k(x), P(y = k | x) % pi_k(x | theta), P(y = k | x, theta) % pi(x) hat, P(y = 1 | x) hat % pi_k(x) hat, P(y = k | x) hat % pi(x^(i)) with hat% pi_k(x^(i)) with hat % p(y | x, theta) % p(y^i |x^i, theta) % log p(y | x, theta) % log p(y^i |x^i, theta)
% probababilistic% Bayes rule % mean vector of class-k Gaussian (discr analysis)
% residual and margin % residual, stochastic % epsilon^i, residual, stochastic % residual, estimated % y f(x), margin % y^i f(x^i), margin % estimated covariance matrix % estimated covariance matrix for the j-th class
% ml - loss, risk, likelihood % L(y, f), loss function % L(y, pi), loss function % L(y, f(x)), loss function % loss of observation % loss with f parameterized % loss of observation with f parameterized % loss of observation with f parameterized % loss in classification % loss in classification % loss of observation in classification % loss with pi parameterized % loss of observation with pi parameterized % L(y, h(x)), loss function on discrete classes % L(r), loss defined on residual (reg) / margin (classif) % L1 loss % L2 loss % Bernoulli loss for -1, +1 encoding % Bernoulli loss for 0, 1 encoding % cross-entropy loss % Brier score % R, risk % R(f), risk % risk def (expected loss) % R(theta), risk % R_emp, empirical risk w/o factor 1 / n % R_emp, empirical risk w/ factor 1 / n % R_emp(f) % R_emp(theta) % R_reg, regularized risk % R_reg(theta) % R_reg(f) % hat R_reg(theta) % hat R_emp(theta) % L, likelihood % L(theta), likelihood % L(theta|x), likelihood % l, log-likelihood % l(theta), log-likelihood % l(theta|x), log-likelihood % training error % test error % avg training error
% lm % linear model % OLS estimator in LM
% resampling % size of the test set % size of the train set % size of the i-th test set % size of the i-th train set % index vector train data % index vector test data % index vector i-th train dataset % index vector i-th test dataset % D_train,i, i-th training set% D_test,i, i-th test set
% space of train indices of size n_train % space of train indices of size n_train % space of train indices of size n_test % output vector associated to index J % def of the output vector associated to index J % cali-J, set of all splits % (Jtrain_1,Jtest_1) …(Jtrain_B,Jtest_B) % Generalization error % GE % GE-hat % GE full % GE hat holdout % GE hat holdout i-th set % GE-hat(lam) % GE-hat_I,J,rho(lam) % GE-hat_I,J,rho(lam) % GE formal def % aggregate function % GE of a fitted model % GEh of a fitted model % GE of a fitted model wrt loss L % pointwise loss of fitted model% GE of a fitted model % GE of inducer % GE indexed with data % expected GE % expectation wrt data of size n
% performance measure % perf. measure derived from pointwise loss % matrix of prediction scores % i-th row vector of the predscore mat % predscore mat idxvec J % predscore mat idxvec J and model f % predscore mat idxvec Jtest and model f hat % predscore mat idxvec Jtest and model f% predscore mat i-th idxvec Jtest and model f % def of predscore mat idxvec J and model f % Set of all datasets times HP space
% ml - ROC % no. of positive instances % no. of negative instances % proportion negative instances % proportion negative instances % true/false pos/neg: % true pos % false pos (fp taken for partial derivs) % true neg % false neg
% ml - trees, extra trees % (Parent) node N % node N_k % Left node N_1 % Right node N_2 % class probability node N % estimated class probability node N % estimated class probability left node% estimated class probability right node
% ml - bagging, random forest % baselearner, default m % estimated base learner, default m % baselearner, default m % ensembled predictor % estimated ensembled predictor % ambiguity/instability of ensemble % weight of basemodel m% weight of basemodel m with hat % last baselearner
% ml - boosting % prediction in iteration m % prediction in iteration m % prediction m-1 % prediction m-1 % weighted in-sample misclassification rate % weight vector of basemodel m % weight of obs i of basemodel m % parameters of basemodel m % parameters of basemodel m with hat % baselearner, default m % ensemble % pseudo residuals % pseudo residuals % terminal-region % terminal-region % mean, terminal-regions % mean, terminal-regions with hat% mean, terminal-regions
% ml - boosting iml lecture % theta % BL j with theta % BL j with theta $$
Hint: Useful libraries for this exercise sheet
Exercise 1: Random search
Random search (RS) is a simple yet effective tuning procedure. We will implement RS from scratch to find the optimal degree \(d \in \mathbb{N}\) in a polynomial regression problem.
Consider the following skeleton of pseudo-code:
- What should this algorithm return as a result?
Solution
The minimum required output will be the optimal degree \(d^\ast\). (Additional info might enhance user experience in a real-world implementation.)What should be the required user input? Add the inputs to the pseudo-code.
Hint: Use a single hold-out split in evaluation.
Solution
We need (at least) the following input:
- A search space for \(d\)
- The number of RS trials (budget)
- Data to train and evaluate the learner on + the proportion to use as training data (aternatively, we might require separate training and test datasets)
- An evaluation criterion
Let’s add that to the pseudo-code:
Start to implement the main body by
- defining elements that allow you to store the currently optimal candidate,
- performing a holdout split on the data, and
- setting up a loop for evaluation of each candidate.
Solution
Tracking optimal candidates
- We’ll need a variable that is set to the optimal degree in any given iteration, to be updated when a candidate performs better than previous ones.
- Likewise, we’ll need to store the estimated generalization error associated with the optimal candidate so we can compare it to new candidates (we’ll assume that smaller \(\rho\) means lower GE)
Holdout split
Simple one-liner.
Loop
- The first thing to do is defining the set of candidate values, which we’ll obtain by drawing as many random samples from the search space as our budget allows.
- Afterwards, we can run a
forloop over this candidate set.
There’s no official pseudo-code language–you should phrase your code such that a human with knowledge in any suitable programming language can unambiguously translate the pseudo-code into that programming language.
- Finalize the pseudo-code by adding steps for training & evaluation and a rule to update the optimal candidate.
Solution
- Describe how you could implement a more flexible resampling strategy.
Solution
A more general way to define the user input might be to accept sets of training and test datasets, respectively. We could then add afor loop over those to compute the GE associated with each candidate \(d\).
Exercise 2: Basic tuning techniques
- Explain the difference in implementation between random search (RS) and grid search (GS).
Solution
- The main difference lies in how we obtain the candidate values: RS samples them uniformly from the search space, while GS creates a multi-dimensional grid from the search space and tries all configurations in it.
- In the algorithm from Exercise 1, we’d thus need to adjust the
candidates\(\leftarrow \dots\) part (and slightly adapt the user input).
Consider the following objective function \(c(\lambda)\) with a single hyperparameter \(\lambda \in \mathbb{R}\). The objective is to be minimized. Explain whether RS or GS will be more suited to find the optimal value \(\lambda^\ast\), given
- a search space of \(\lambda \in [0, 50]\), and
- a budget of 6 evaluations.
Solution
- In this (stylized) example, the discretization of GS is quite harmful: Choosing the grid as the lower and upper bounds of the search space plus 4 equidistant values within, as is common, will prevent us from ever exploring promising values for \(\lambda\).
- With RS, every value in \([0, 50]\) is equally likely to be tried, so we have at least a chance to find one of the good values \(\rightsquigarrow\) RS is preferable here.
Consider now a bivariate objective function \(c(\lambda_1, \lambda_2)\) with \(\lambda_1, \lambda_2 \in \mathbb{R}\). The objective is to be minimized. Suppose that \(\lambda_2\) is vastly more important for the objective than \(\lambda_1\).
Visualize the number of different values of \(\lambda_2\) that RS and (exhaustive) GS are expected to explore for a budget \(B\) of 9, 25, 100, 400, 2500 evaluations.
Solution
- Since GS comes with discretization, distributing the budget across both hyperparameter means that \(\lambda_1\) receives \(\sqrt{B}\) candidate values even though it doesn’t matter for the objective. Consequently, we only have \(\sqrt{B}\) values for \(\lambda_2\) left to explore.
- RS, on the other hand, will (in expectation) evaluate \(B\) different values of \(\lambda_2\) for every budget, making it much more efficient in this situation.
Exercise 3: Interpreting tuning results
In this exercise we will perform hyperparameter optimization (HPO) for the task of classifying the credit risk data with a \(k\)-NN classifier. We use mlr3 (tuning chapter in the book); see Jupyter notebook for a similar case study in Python.
The kknn implementation used by mlr3 contains several hyperparameters, three of which are to be tuned for our prediction:
- \(k\) (number of neighbors)
kernelscale
Details about the hyperparameters
\(k\): determines the size of the neighborhood and thus influences the locality of the model. Smaller neighborhoods reflect the belief that only very similar (close) neighbors should be allowed to weigh into the prediction of a new observation, and predictions may change strongly for only small changes of the input variables. If \(k\) is chosen too small, we may encounter overfitting. Conversely, larger neighborhoods produce a more global model with larger parts of the input space receiving the same prediction. Setting \(k\) too large may result in underfitting.
kernel: determines the importance weights in the \(k\)-neighborhood.Can you guess which kernel corresponds to unweighted \(k\)-NN?
scale(logical): defines whether variables should be normalized to equal standard deviation. This is often reasonable to avoid implicit importance weighting through different natural scales (for example, recall that neighborhoods in a bivariate feature space are circular for quadratic distance – scaling either dimension will change which observations end up in the neighborhood).
- You receive the following
mlr3output from tuning \(k\)-NN with random search. Interpret the impact of each hyperparemeter on the objective.
Solution
We plot the classification performance in terms of AUC across different choices of each hyperparameter. Let’s look at each one in turn:
- Increasing \(k\) initially leads to an improvement that plateaus after around 50 neighbors. However, there seem to be two quite distinct groups of candidate values.
- Scaling the variables boosts performance quite substantially.
- The choice of the kernel does not seem to have much impact. Again, we see two clusters of candidate values.
Obviously, the interpretability of these plots is limited: we only see marginal effects of individual hyperparameters. The fact that they really interact with each other contributes substantially to the difficulty of the tuning problem. We can clearly see this in the plot for \(k\) and kernel, where we have two quite distinct patterns corresponding to different values of scale.
- Now let’s look at the code that generated the above results. Start by defining the
german_credittask, where you reserve 800 observations for training, and thekknnlearner (the learner should output probabilities).
Solution
- Set up the search space to tune over using the
psfunction. Include choices for \(k \in \{1, 2, \dots, 100\}\),scale\(\in \{\text{yes}, \text{no}\}\), andkernel\(\in \{\text{rectangular}, \text{epanechnikov}, \text{gaussian}, \text{optimal}\}\).
Solution
Define the stopping criterion for random search with a so-called terminator (
trm). We want the tuning procedure to finish after 200 evaluations or a maximum runtime of 30 seconds.Hint: You can define this combinded terminator via a list of individual terminators.
Solution
- Set up a tuning instance using the function
ti. This object combines all of the above components. Set the evaluation criterion to AUC.
Solution
- Finally, define the tuner (
tnr) of type “random_search” and run the optimization. Don’t forget to make your code reproducible.
Solution
Optimization
Visualization
- With the hyperparameter configuration found via HPO, fit the model on all training observations and compute the AUC on your test data.